Credit Card Users Churn Prediction¶
Context¶
The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Objective¶
Customers’ leaving credit card services would lead the bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and the reason for same – so that the bank could improve upon those areas. You as a Data Scientist at Thera Bank need to explore the data provided, identify patterns, and come up with a classification model to identify customers likely to churn, and provide actionable insights and recommendations that will help the bank improve its services so that customers do not renounce their credit cards.
Data Description¶
- CLIENTNUM: Client number. Unique identifier for the customer holding the account
- Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
- Customer_Age: Age in Years
- Gender: The gender of the account holder
- Dependent_count: Number of dependents
- Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
- Marital_Status: Marital Status of the account holder
- Income_Category: Annual Income Category of the account holder
- Card_Category: Type of Card
- Months_on_book: Period of relationship with the bank
- Total_Relationship_Count: Total no. of products held by the customer
- Months_Inactive_12_mon: No. of months inactive in the last 12 months
- Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
- Credit_Limit: Credit Limit on the Credit Card
- Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
- Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
- Total_Trans_Amt: Total Transaction Amount (Last 12 months)
- Total_Trans_Ct: Total Transaction Count (Last 12 months)
- Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in the 1st quarter
- Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in the 1st quarter
- Avg_Utilization_Ratio: Represents how much of the available credit the customer spent
Exploratory Data Analysis¶
Import Library¶
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from imblearn.over_sampling import SMOTE
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn.dummy import DummyClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
RocCurveDisplay,
)
# To be used for data scaling and encoding
from sklearn.preprocessing import (
StandardScaler,
MinMaxScaler,
OneHotEncoder,
RobustScaler,
)
from sklearn.impute import SimpleImputer
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
#Read the dataset
df = pd.read_csv('BankChurners.csv')
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
Shape of Dataframe¶
df.shape
(10127, 21)
- There are 10127 rows with 21 columns
Info of Dataframe¶
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
- There are 6 categorical columns and remaining are all numerical columns
- CLIENTNUM can be dropped as that doesnot help in identifying any patterns
df = df.drop(['CLIENTNUM'], axis = 1)
df.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
df.describe()
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 | 10127.000000 |
| mean | 46.325960 | 2.346203 | 35.928409 | 3.812580 | 2.341167 | 2.455317 | 8631.953698 | 1162.814061 | 7469.139637 | 0.759941 | 4404.086304 | 64.858695 | 0.712222 | 0.274894 |
| std | 8.016814 | 1.298908 | 7.986416 | 1.554408 | 1.010622 | 1.106225 | 9088.776650 | 814.987335 | 9090.685324 | 0.219207 | 3397.129254 | 23.472570 | 0.238086 | 0.275691 |
| min | 26.000000 | 0.000000 | 13.000000 | 1.000000 | 0.000000 | 0.000000 | 1438.300000 | 0.000000 | 3.000000 | 0.000000 | 510.000000 | 10.000000 | 0.000000 | 0.000000 |
| 25% | 41.000000 | 1.000000 | 31.000000 | 3.000000 | 2.000000 | 2.000000 | 2555.000000 | 359.000000 | 1324.500000 | 0.631000 | 2155.500000 | 45.000000 | 0.582000 | 0.023000 |
| 50% | 46.000000 | 2.000000 | 36.000000 | 4.000000 | 2.000000 | 2.000000 | 4549.000000 | 1276.000000 | 3474.000000 | 0.736000 | 3899.000000 | 67.000000 | 0.702000 | 0.176000 |
| 75% | 52.000000 | 3.000000 | 40.000000 | 5.000000 | 3.000000 | 3.000000 | 11067.500000 | 1784.000000 | 9859.000000 | 0.859000 | 4741.000000 | 81.000000 | 0.818000 | 0.503000 |
| max | 73.000000 | 5.000000 | 56.000000 | 6.000000 | 6.000000 | 6.000000 | 34516.000000 | 2517.000000 | 34516.000000 | 3.397000 | 18484.000000 | 139.000000 | 3.714000 | 0.999000 |
- Age Range:
The age of customers in your dataset ranges from 26 to 73 years. The majority of customers fall between 41 and 52 years old, with the median age being 46. This indicates a relatively mature customer base.
- Credit Limit:
Credit limits vary widely from as low as $1,438.30 to as high as $34,516.00. The median credit limit is $4,549.00, with the upper quartile at $11,067.50. This shows a substantial range in customers' available credit, suggesting that the dataset includes both lower and higher credit limits.
- Monthly Activity:
Customers' total transaction amounts vary from $510 to $18,484, with a median of $3,899. The upper quartile is $4,741, indicating that a significant portion of customers has relatively high transaction activity. This suggests that spending behavior can vary significantly among customers.
- Revolving Balance:
The total revolving balance, which ranges from $0 to $2,517, has a median of $1,276. Most customers have revolving balances that are on the lower end, with the 75th percentile at $1,784. This may reflect either conservative spending or effective credit management by many customers.
- Utilization Ratio:
The average utilization ratio, which measures the proportion of available credit being used, ranges from 0.00 to 0.999. The median utilization ratio is 0.176, while the 75th percentile is 0.503. This indicates that while some customers use a significant portion of their available credit, many maintain relatively low utilization rates, which can be indicative of good credit management practices.
df.describe(include = 'O')
| Attrition_Flag | Gender | Education_Level | Marital_Status | Income_Category | Card_Category | |
|---|---|---|---|---|---|---|
| count | 10127 | 10127 | 8608 | 9378 | 10127 | 10127 |
| unique | 2 | 2 | 6 | 3 | 6 | 4 |
| top | Existing Customer | F | Graduate | Married | Less than $40K | Blue |
| freq | 8500 | 5358 | 3128 | 4687 | 3561 | 9436 |
- The above are the top,unique and frequent values for categorical columns
Duplicate values¶
#check if there are duplicates values
df.duplicated().sum()
0
- There are no duplicates
Univariate Analysis¶
Target Variable distribution¶
target_counts = df['Attrition_Flag'].value_counts()
# Plotting the pie chart
plt.figure(figsize=(8, 6)) # Optional: set figure size
plt.pie(target_counts, labels=target_counts.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired(range(len(target_counts))))
plt.title('Distribution of Target Variable')
plt.show()
print('Total number of customers:', df['Attrition_Flag'].count())
print(df['Attrition_Flag'].value_counts())
Total number of customers: 10127 Attrition_Flag Existing Customer 8500 Attrited Customer 1627 Name: count, dtype: int64
Categorical variables distribution¶
Gender¶
target_counts = df['Gender'].value_counts()
# Plotting the pie chart
plt.figure(figsize=(8, 6)) # Optional: set figure size
plt.pie(target_counts, labels=target_counts.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired(range(len(target_counts))))
plt.title('Gender')
plt.show()
print('Total count:', df['Gender'].count())
print(df['Gender'].value_counts())
Total count: 10127 Gender F 5358 M 4769 Name: count, dtype: int64
- There are more females than males in the dataset
Education level¶
# Create a count plot
sns.countplot(x='Education_Level', data=df)
# Customize the plot
plt.title('Education Levels')
plt.ylabel('Count')
plt.xlabel('Education Level')
# Show the plot
plt.show()
- There are more Graduates in the dataset
Marital Status¶
# Create a count plot
sns.countplot(x='Marital_Status', data=df)
# Customize the plot
plt.title('Marital Status')
plt.ylabel('Count')
plt.xlabel('Marital Status')
# Show the plot
plt.show()
- There are more Married Status in the dataset
Income Category¶
# Customize the plot
plt.figure(figsize=(8, 6))
sns.countplot(x='Income_Category', data=df)
plt.title('Income Category')
plt.ylabel('Count')
plt.xlabel('Income Category')
# Rotate the x-axis labels to prevent overlap
plt.xticks(rotation=15)
# Show the plot
plt.show()
- Most of the customers are in 'less than $40K' income category
Card Category¶
# Create a count plot
sns.countplot(x='Card_Category', data=df)
# Customize the plot
plt.title('Card Category')
plt.ylabel('Count')
plt.xlabel('Card Category')
# Show the plot
plt.show()
- There are more 'Blue' categorised customers in the dataset
BiVariate Analysis¶
Numerical variables by Target¶
# Customer Age
sns.displot( data=df, x='Customer_Age', hue='Attrition_Flag', bins=[26, 35, 45, 55, 75], multiple='stack')
plt.title('Customer Age')
plt.show()
# Dependent_count
sns.displot( data=df, x='Dependent_count', hue='Attrition_Flag', multiple='stack')
plt.title('Dependent count')
plt.show()
# Months_on_book
sns.displot( data=df, x='Months_on_book', hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Months on book')
plt.show()
#Total_Relationship_Count
sns.displot( data=df, x='Total_Relationship_Count', hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Relationship Count')
plt.show()
# Months_Inactive_12_mon
sns.displot( data=df, x='Months_Inactive_12_mon', hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Months Inactive 12_mon Count')
plt.show()
#Contacts_Count_12_mon
sns.displot( data=df, x='Contacts_Count_12_mon', hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Contacts 12_mon Count')
plt.show()
#Credit_Limit
sns.displot( data=df, x='Credit_Limit', hue='Attrition_Flag', bins=25, multiple='stack')
plt.title('Credit Limit')
plt.show()
#Total_Revolving_Bal
sns.displot( data=df, x='Total_Revolving_Bal', hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Revolving Bal')
plt.show()
#Avg_Open_To_Buy
sns.displot( data=df, x='Avg_Open_To_Buy', hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Avg Open To Buy')
plt.show()
#Total_Amt_Chng_Q4_Q1
sns.displot( data=df, x='Total_Amt_Chng_Q4_Q1', hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Amt Chng Q4_Q1')
plt.show()
#Total_Trans_Amt
sns.displot( data=df, x='Total_Trans_Amt', hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Trans Amt')
plt.show()
#Total_Trans_Ct
sns.displot( data=df, x='Total_Trans_Ct', hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Trans Ct')
plt.show()
#Total_Ct_Chng_Q4_Q1
sns.displot( data=df, x='Total_Ct_Chng_Q4_Q1', hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Total Ct Chng Q4_Q1')
plt.show()
#Avg_Utilization_Ratio
sns.displot( data=df, x='Avg_Utilization_Ratio', hue='Attrition_Flag', bins=15, multiple='stack')
plt.title('Avg Utilization Ratio')
plt.show()
plt.figure(figsize=(20, 20))
sns.set(palette="nipy_spectral")
sns.pairplot(data=df, hue="Attrition_Flag", corner=True)
<seaborn.axisgrid.PairGrid at 0x148fa6710>
<Figure size 2000x2000 with 0 Axes>
Data Preprocessing¶
Outliers¶
# Get all numerical columns
numerical_columns = df.select_dtypes(include=['number'])
# Melt the DataFrame to long format for box plotting
df_melted = df.melt(id_vars='Attrition_Flag', value_vars=numerical_columns.columns.tolist(), var_name='Num Variables', value_name='Count')
# Create a box plot for all numerical columns
plt.figure(figsize=(20, 10))
sns.boxplot(x='Num Variables', y='Count', hue='Attrition_Flag', data=df_melted)
plt.xticks(rotation=15)
# Customize the plot
plt.title('Box Plot of All Numerical Columns by Category', fontsize=16, fontweight='semibold')
plt.xlabel('Num Variables')
plt.ylabel('Count')
# Show the plot
plt.show()
Correlation¶
cm = numerical_columns.corr()
plt.figure(figsize = (14, 10))
sns.heatmap(cm, annot = True, cmap = 'viridis')
<Axes: >
Insights:¶
- Strong Positive Correlation between Total_Trans_Ct and Total_Trans_Amt:
The highest correlation in the chart is between Total_Trans_Ct and Total_Trans_Amt with a value of 0.81. This suggests that the more transactions a customer makes, the higher their total transaction amount is.
- Credit_Limit and Avg_Open_To_Buy:
There is a strong positive correlation (0.62) between Credit_Limit and Avg_Open_To_Buy, indicating that customers with higher credit limits also tend to have more available credit.
- Avg_Utilization_Ratio and Total_Revolving_Bal:
A moderate positive correlation (0.62) exists between Avg_Utilization_Ratio and Total_Revolving_Bal, suggesting that as the revolving balance increases, the credit utilization ratio also increases.
- Credit_Limit and Avg_Utilization_Ratio:
There is a moderate negative correlation (-0.48) between Credit_Limit and Avg_Utilization_Ratio, indicating that customers with higher credit limits tend to have a lower utilization ratio.
- Total_Trans_Amt and Avg_Open_To_Buy:
A weak positive correlation (0.17) exists between Total_Trans_Amt and Avg_Open_To_Buy, which suggests a slight relationship between how much a customer spends and how much available credit they have.
Null Values¶
df.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
- Education_Level and Marital_Status are to be treated for missing values
df['Education_Level'] = df["Education_Level"].fillna('Unknown')
df['Marital_Status'] = df["Marital_Status"].fillna('Unknown')
df.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
- No more missing values
#Check for unique values
for column in df.columns:
unique_values = df[column].unique()
print(f"Column '{column}' has {len(unique_values)} unique values:")
print(unique_values)
print("\n")
Column 'Attrition_Flag' has 2 unique values: ['Existing Customer' 'Attrited Customer'] Column 'Customer_Age' has 45 unique values: [45 49 51 40 44 32 37 48 42 65 56 35 57 41 61 47 62 54 59 63 53 58 55 66 50 38 46 52 39 43 64 68 67 60 73 70 36 34 33 26 31 29 30 28 27] Column 'Gender' has 2 unique values: ['M' 'F'] Column 'Dependent_count' has 6 unique values: [3 5 4 2 0 1] Column 'Education_Level' has 7 unique values: ['High School' 'Graduate' 'Uneducated' 'Unknown' 'College' 'Post-Graduate' 'Doctorate'] Column 'Marital_Status' has 4 unique values: ['Married' 'Single' 'Unknown' 'Divorced'] Column 'Income_Category' has 6 unique values: ['$60K - $80K' 'Less than $40K' '$80K - $120K' '$40K - $60K' '$120K +' 'abc'] Column 'Card_Category' has 4 unique values: ['Blue' 'Gold' 'Silver' 'Platinum'] Column 'Months_on_book' has 44 unique values: [39 44 36 34 21 46 27 31 54 30 48 37 56 42 49 33 28 38 41 43 45 52 40 50 35 47 32 20 29 25 53 24 55 23 22 26 13 51 19 15 17 18 16 14] Column 'Total_Relationship_Count' has 6 unique values: [5 6 4 3 2 1] Column 'Months_Inactive_12_mon' has 7 unique values: [1 4 2 3 6 0 5] Column 'Contacts_Count_12_mon' has 7 unique values: [3 2 0 1 4 5 6] Column 'Credit_Limit' has 6205 unique values: [12691. 8256. 3418. ... 5409. 5281. 10388.] Column 'Total_Revolving_Bal' has 1974 unique values: [ 777 864 0 ... 534 476 2241] Column 'Avg_Open_To_Buy' has 6813 unique values: [11914. 7392. 3418. ... 11831. 5409. 8427.] Column 'Total_Amt_Chng_Q4_Q1' has 1158 unique values: [1.335 1.541 2.594 ... 0.222 0.204 0.166] Column 'Total_Trans_Amt' has 5033 unique values: [ 1144 1291 1887 ... 10291 8395 10294] Column 'Total_Trans_Ct' has 126 unique values: [ 42 33 20 28 24 31 36 32 26 17 29 27 21 30 16 18 23 22 40 38 25 43 37 19 35 15 41 57 12 14 34 44 13 47 10 39 53 50 52 48 49 45 11 55 46 54 60 51 63 58 59 61 78 64 65 62 67 66 56 69 71 75 74 76 84 82 88 68 70 73 86 72 79 80 85 81 87 83 91 89 77 103 93 96 99 92 90 94 95 98 100 102 97 101 104 105 106 107 109 118 108 122 113 112 111 127 114 124 110 120 125 121 117 126 134 116 119 129 131 115 128 139 123 130 138 132] Column 'Total_Ct_Chng_Q4_Q1' has 830 unique values: [1.625 3.714 2.333 2.5 0.846 0.722 0.714 1.182 0.882 0.68 1.364 3.25 2. 0.611 1.7 0.929 1.143 0.909 0.6 1.571 0.353 0.75 0.833 1.3 1. 0.9 2.571 1.6 1.667 0.483 1.176 1.2 0.556 0.143 0.474 0.917 1.333 0.588 0.8 1.923 0.25 0.364 1.417 1.083 1.25 0.5 1.154 0.733 0.667 2.4 1.05 0.286 0.4 0.522 0.435 1.875 0.966 1.412 0.526 0.818 1.8 1.636 2.182 0.619 0.933 1.222 0.304 0.727 0.385 1.5 0.789 0.542 1.1 1.095 0.824 0.391 0.346 3. 1.056 1.118 0.786 0.625 1.533 0.382 0.355 0.765 0.778 2.2 1.545 0.7 1.211 1.231 0.636 0.455 2.875 1.308 0.467 1.909 0.571 0.812 2.429 0.706 2.167 0.263 0.429 2.286 0.828 1.467 0.478 0.867 0.88 1.444 1.273 0.941 0.684 0.591 0.762 0.529 0.615 0.519 0.421 0.947 1.167 1.105 0.737 1.263 0.538 1.071 0.357 0.407 0.923 1.455 0.35 2.273 0.69 0.65 0.167 0.647 1.615 0.545 0.875 1.125 0.462 1.294 1.357 3.5 1.067 1.286 0.524 1.214 0.273 1.538 0.783 0.235 0.607 2.083 0.632 0.368 0.444 0.76 0.536 0.438 0.423 2.1 0.565 0.719 0.182 1.75 0.944 0.581 0.333 0.643 0.87 0.692 1.227 0.938 1.833 0.652 1.462 0.583 0.679 0.375 1.091 2.75 1.385 1.188 0.261 1.312 0.656 1.235 0.958 0.37 0.059 0.3 0.613 1.778 0.955 0.864 1.429 0.889 1.438 0.481 0.452 1.13 0.562 1.048 0.409 0.622 0.688 1.217 0.211 0.606 0.655 0.381 1.053 1.316 0.575 0.85 0.41 0.609 1.579 0.56 0.276 0.533 0.515 0.308 0.852 0.371 0.214 0.63 0.231 0.406 0.405 0.349 0.857 0.212 0.543 1.059 0.579 0.387 0.724 0.415 0.895 0.781 0.412 0.649 0.32 0.345 0.367 0.586 0.324 0.306 0.676 0.708 0.476 0.29 0.55 0.133 0.344 0.52 0.471 0.842 0.654 0.516 0.464 1.857 0.629 0.963 0.686 0.323 0.585 0.633 0.92 0.441 0.424 0.59 0.763 0.207 0.314 2.222 1.45 0.469 3.571 0.696 0.741 0.512 1.043 0.568 0.548 0.194 0.552 0.448 0.651 0.393 0.657 0.682 0.808 1.032 0.577 0.241 0.425 0.348 0.318 0.292 0.312 0.486 0.969 0.697 0.389 0.44 0.829 0.677 0.189 0.259 0.72 0.815 1.15 0.806 0.537 0.721 0.531 0.472 0.594 0.773 0.826 0.906 0.417 0.758 1.107 0.621 0.458 0.267 0.107 0.459 0.71 0.487 0.95 0.321 0.414 0.742 0.739 0.767 0.394 0.091 0.926 0.618 0.784 0.208 1.136 0.897 0.593 0.294 0.718 1.375 0.862 0.439 0.839 0.595 1.208 0.96 0.514 0.433 0.484 1.08 0.931 0.233 0.971 0.957 1.038 0.48 0.731 1.474 1.062 0.608 1.103 1.111 0.725 1.647 0.774 0.477 0.238 0.967 0.769 0.576 0.567 1.042 0.759 0.81 1.069 0.574 0.528 0.278 0.703 0.447 0.028 0.297 1.037 0.269 0.962 0.905 0.111 0.513 0.31 0.614 0.436 0.45 1.48 0.296 0.879 1.114 0.262 1.278 0.257 0.517 1.36 0.605 1.04 0.711 0.844 0.623 0.913 0.756 1.045 0.775 0.645 0.793 0.488 0.511 0.811 0.838 0.641 0.646 0.972 0.559 0.659 0.525 0.038 0.871 0.919 0.179 0.639 0.077 0.564 0.419 0.853 0.64 0.848 1.033 0.351 0.675 0.743 0.952 1.077 1.087 1.12 0.885 0.592 0.893 0.265 1.292 0.457 0.771 0.977 0.053 1.318 0.809 0.674 0.968 0.316 0.15 0.558 0.485 0.735 0.275 0.19 1.381 0.379 0.689 0.561 0.174 0.217 1.174 0.766 0.683 0. 0.281 0.28 0.492 0.788 0.865 0.881 0.794 0.712 0.658 0.891 1.24 0.911 0.946 0.2 0.465 0.489 0.541 0.86 0.628 0.062 0.795 1.722 0.892 0.578 0.704 0.732 0.587 0.956 0.185 0.341 0.58 0.378 1.036 0.549 0.491 0.702 0.638 0.176 0.912 0.535 0.521 0.653 0.604 0.73 0.66 1.139 0.509 1.882 0.463 0.634 0.694 1.148 0.757 1.35 0.362 0.822 0.755 0.395 0.861 0.738 1.133 0.872 0.886 1.156 0.532 1.03 0.453 0.821 1.034 0.635 0.154 0.903 1.207 1.31 0.523 0.878 0.744 0.317 0.93 0.24 0.804 0.761 0.54 0.479 0.551 1.4 0.553 0.426 0.816 0.698 0.227 0.896 0.792 1.051 0.61 0.884 0.408 0.617 0.935 0.361 0.902 0.78 0.841 0.796 0.975 1.081 0.707 0.422 0.964 0.172 0.805 0.717 0.347 1.138 0.791 0.681 0.256 1.609 0.868 0.468 0.432 1.121 0.787 0.596 0.976 1.158 1.028 0.949 0.451 0.456 0.837 1.212 0.673 0.222 0.171 0.51 0.685 0.396 0.388 0.644 0.914 1.476 0.46 0.547 1.421 0.825 0.729 0.723 1.471 0.939 0.974 0.943 0.84 0.627 0.13 1.147 0.327 1.065 0.705 1.37 0.854 0.951 0.569 0.921 0.776 0.927 0.449 0.475 0.97 1.097 0.612 1.024 1.088 0.648 0.242 0.661 0.745 1.522 0.843 0.907 1.027 1.783 0.62 0.814 1.026 0.851 1.094 0.431 1.057 0.226 0.736 0.103 1.29 0.925 0.566 0.161 0.303 1.152 1.65 0.74 1.194 1.226 0.642 1.323 1.025 1.074 0.508 0.49 0.534 0.83 0.978 1.206 1.054 0.936 0.932 0.105 1.061 1.031 1.478 0.898 0.672 0.188 0.518 0.953 1.049 1.086 0.691 0.411 1.029 1.419 1.075 0.206 0.973 1.219 1.162 0.827 1.321 0.343 0.764 0.125 0.119 1.189 1.179 1.258 1.229 1.073 0.074 1.458 1.172 1.32 1.108 1.16 0.36 1.391 1.583 0.147 1.115 0.359 1.128 0.915 0.282 0.162 1.303 0.582 1.382 1.171 0.029 1.161 0.192 1.346 0.473 0.097 0.82 0.557 0.894 1.135 1.367 1.023 0.544 0.589 0.603 0.442 0.295 0.434 0.554 0.372 0.527 0.709 0.782 0.797 0.695 0.849 0.768 0.863 0.746 0.597 0.631 0.678 0.887 0.754 0.687 0.699 0.873 0.716 0.934 0.847 0.244 0.803 0.772 0.859 1.064 0.819 0.573 0.807 0.79 0.817 0.785 0.823 0.836 0.616 0.831 1.06 1.122 0.866 0.662 0.869 0.779 0.981 0.293 0.855 0.98 0.671 1.079 0.693 0.77 1.093 1.018 1.022 0.734 0.753 0.726 0.922 0.948 1.684 0.918] Column 'Avg_Utilization_Ratio' has 964 unique values: [0.061 0.105 0. 0.76 0.311 0.066 0.048 0.113 0.144 0.217 0.174 0.195 0.279 0.23 0.078 0.095 0.788 0.08 0.086 0.152 0.626 0.215 0.093 0.099 0.285 0.658 0.69 0.282 0.562 0.135 0.544 0.757 0.241 0.077 0.018 0.355 0.145 0.209 0.793 0.074 0.259 0.591 0.687 0.127 0.667 0.843 0.422 0.156 0.525 0.587 0.211 0.088 0.111 0.044 0.276 0.704 0.656 0.053 0.051 0.467 0.698 0.067 0.079 0.287 0.36 0.256 0.719 0.198 0.14 0.035 0.619 0.108 0.062 0.765 0.963 0.524 0.347 0.45 0.232 0.299 0.085 0.059 0.43 0.62 0.027 0.169 0.058 0.223 0.057 0.513 0.473 0.047 0.106 0.05 0.03 0.615 0.15 0.407 0.191 0.096 0.176 0.83 0.412 0.678 0.246 0.271 0.114 0.395 0.406 0.258 0.178 0.941 0.141 0.118 0.119 0.64 0.432 0.612 0.359 0.309 0.101 0.607 0.512 0.806 0.463 0.77 0.076 0.133 0.037 0.146 0.171 0.069 0.837 0.055 0.294 0.39 0.19 0.692 0.503 0.251 0.11 0.087 0.214 0.164 0.049 0.043 0.679 0.098 0.694 0.039 0.199 0.22 0.13 0.202 0.319 0.165 0.863 0.665 0.598 0.539 0.472 0.064 0.16 0.42 0.713 0.092 0.336 0.666 0.147 0.987 0.073 0.88 0.28 0.65 0.761 0.072 0.327 0.459 0.252 0.244 0.291 0.46 0.489 0.482 0.24 0.197 0.866 0.317 0.762 0.162 0.196 0.734 0.446 0.262 0.042 0.094 0.308 0.68 0.238 0.753 0.877 0.724 0.117 0.638 0.102 0.131 0.255 0.716 0.609 0.405 0.154 0.605 0.275 0.06 0.07 0.186 0.648 0.167 0.153 0.79 0.732 0.123 0.221 0.2 0.063 0.785 0.771 0.224 0.795 0.187 0.583 0.316 0.447 0.625 0.514 0.557 0.955 0.867 0.846 0.756 0.31 0.373 0.935 0.155 0.435 0.932 0.829 0.953 0.188 0.82 0.616 0.595 0.521 0.268 0.09 0.885 0.546 0.569 0.183 0.639 0.329 0.274 0.161 0.865 0.73 0.134 0.137 0.478 0.361 0.312 0.036 0.243 0.805 0.168 0.103 0.179 0.529 0.227 0.706 0.075 0.804 0.708 0.766 0.381 0.046 0.428 0.112 0.041 0.85 0.517 0.72 0.056 0.548 0.436 0.201 0.523 0.081 0.403 0.671 0.752 0.194 0.657 0.476 0.729 0.911 0.78 0.35 0.636 0.632 0.226 0.798 0.781 0.148 0.029 0.12 0.651 0.257 0.204 0.231 0.18 0.617 0.458 0.142 0.054 0.374 0.491 0.216 0.572 0.32 0.212 0.545 0.314 0.393 0.599 0.33 0.663 0.159 0.185 0.371 0.506 0.448 0.128 0.269 0.333 0.125 0.091 0.53 0.303 0.682 0.456 0.584 0.337 0.51 0.819 0.543 0.81 0.189 0.213 0.068 0.033 0.261 0.071 0.41 0.712 0.515 0.593 0.203 0.286 0.457 0.654 0.122 0.345 0.825 0.1 0.206 0.976 0.17 0.292 0.139 0.109 0.278 0.324 0.745 0.402 0.397 0.045 0.177 0.611 0.284 0.578 0.318 0.803 0.594 0.684 0.019 0.722 0.032 0.115 0.511 0.306 0.104 0.219 0.709 0.621 0.082 0.553 0.465 0.707 0.166 0.859 0.677 0.253 0.586 0.425 0.801 0.084 0.645 0.149 0.343 0.878 0.304 0.814 0.342 0.848 0.163 0.222 0.469 0.519 0.272 0.325 0.702 0.181 0.693 0.809 0.479 0.468 0.356 0.811 0.34 0.63 0.372 0.637 0.507 0.749 0.129 0.674 0.794 0.582 0.464 0.065 0.315 0.691 0.501 0.218 0.56 0.175 0.5 0.378 0.613 0.313 0.727 0.239 0.603 0.57 0.27 0.034 0.247 0.737 0.124 0.589 0.534 0.237 0.136 0.789 0.777 0.52 0.653 0.016 0.346 0.721 0.675 0.138 0.266 0.442 0.326 0.301 0.717 0.023 0.025 0.25 0.281 0.796 0.296 0.334 0.471 0.571 0.352 0.143 0.608 0.775 0.67 0.321 0.696 0.689 0.624 0.408 0.157 0.439 0.672 0.302 0.225 0.357 0.527 0.431 0.831 0.755 0.786 0.026 0.659 0.416 0.451 0.052 0.404 0.394 0.391 0.736 0.854 0.791 0.126 0.363 0.874 0.297 0.341 0.344 0.208 0.733 0.234 0.116 0.828 0.365 0.182 0.384 0.526 0.396 0.031 0.516 0.748 0.354 0.349 0.233 0.497 0.248 0.339 0.132 0.588 0.764 0.705 0.575 0.536 0.021 0.205 0.835 0.549 0.74 0.889 0.083 0.596 0.735 0.827 0.522 0.711 0.377 0.351 0.242 0.366 0.697 0.328 0.778 0.743 0.492 0.715 0.623 0.488 0.263 0.568 0.089 0.779 0.47 0.264 0.415 0.58 0.452 0.289 0.635 0.229 0.75 0.695 0.6 0.784 0.173 0.822 0.812 0.265 0.574 0.475 0.295 0.662 0.3 0.566 0.994 0.669 0.04 0.856 0.532 0.461 0.559 0.331 0.602 0.445 0.466 0.597 0.646 0.474 0.305 0.556 0.742 0.631 0.718 0.606 0.647 0.758 0.644 0.499 0.873 0.245 0.487 0.558 0.49 0.121 0.869 0.797 0.437 0.772 0.7 0.934 0.857 0.015 0.547 0.353 0.699 0.495 0.409 0.29 0.293 0.494 0.477 0.235 0.894 0.417 0.881 0.207 0.928 0.484 0.852 0.038 0.228 0.643 0.655 0.283 0.642 0.581 0.379 0.542 0.579 0.434 0.44 0.535 0.913 0.776 0.551 0.401 0.273 0.172 0.375 0.714 0.668 0.362 0.833 0.633 0.783 0.614 0.763 0.844 0.744 0.61 0.453 0.481 0.563 0.418 0.399 0.348 0.59 0.413 0.498 0.267 0.398 0.386 0.815 0.249 0.429 0.799 0.751 0.821 0.323 0.107 0.807 0.816 0.99 0.573 0.449 0.883 0.768 0.925 0.773 0.38 0.604 0.411 0.832 0.184 0.438 0.552 0.792 0.376 0.641 0.37 0.158 0.426 0.277 0.493 0.629 0.02 0.236 0.21 0.726 0.531 0.92 0.949 0.628 0.731 0.518 0.358 0.554 0.893 0.943 0.944 0.601 0.307 0.725 0.368 0.924 0.661 0.151 0.769 0.576 0.424 0.664 0.024 0.922 0.537 0.884 0.483 0.462 0.899 0.622 0.013 0.954 0.683 0.192 0.774 0.824 0.858 0.984 0.414 0.561 0.879 0.504 0.509 0.968 0.918 0.836 0.332 0.028 0.938 0.541 0.48 0.533 0.528 0.254 0.423 0.288 0.369 0.93 0.813 0.915 0.364 0.688 0.902 0.868 0.942 0.567 0.022 0.703 0.585 0.906 0.754 0.855 0.839 0.681 0.298 0.872 0.455 0.929 0.008 0.388 0.912 0.322 0.853 0.454 0.685 0.747 0.66 0.904 0.738 0.485 0.496 0.577 0.927 0.746 0.565 0.634 0.887 0.845 0.951 0.444 0.427 0.962 0.564 0.851 0.897 0.876 0.421 0.012 0.649 0.759 0.84 0.842 0.87 0.097 0.983 0.433 0.387 0.441 0.767 0.903 0.592 0.895 0.896 0.652 0.8 0.4 0.017 0.862 0.849 0.676 0.999 0.921 0.673 0.948 0.004 0.864 0.55 0.787 0.392 0.443 0.505 0.538 0.71 0.367 0.985 0.741 0.826 0.817 0.508 0.26 0.618 0.94 0.916 0.823 0.9 0.193 0.419 0.841 0.919 0.959 0.723 0.486 0.54 0.905 0.385 0.782 0.006 0.502 0.802 0.875 0.931 0.011 0.926 0.728 0.382 0.335 0.891 0.871 0.701 0.739 0.383 0.834 0.898 0.389 0.901 0.988 0.907 0.686 0.86 0.882 0.861 0.917 0.555 0.808 0.338 0.96 0.972 0.01 0.847 0.964 0.886 0.995 0.818 0.958 0.627 0.992 0.952 0.91 0.978 0.973 0.971 0.945 0.914 0.977 0.956 0.909 0.005 0.007 0.014 0.009]
Feature Engineering¶
## Treating Income Category = abc
df.loc[df[df['Income_Category'] == 'abc'].index, 'Income_Category'] = 'Unknown'
df['Income_Category'].unique()
array(['$60K - $80K', 'Less than $40K', '$80K - $120K', '$40K - $60K',
'$120K +', 'Unknown'], dtype=object)
df1 = df.copy()
df1.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Customer_Age | 10127.0 | NaN | NaN | NaN | 46.32596 | 8.016814 | 26.0 | 41.0 | 46.0 | 52.0 | 73.0 |
| Gender | 10127 | 2 | F | 5358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Dependent_count | 10127.0 | NaN | NaN | NaN | 2.346203 | 1.298908 | 0.0 | 1.0 | 2.0 | 3.0 | 5.0 |
| Education_Level | 10127 | 7 | Graduate | 3128 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Marital_Status | 10127 | 4 | Married | 4687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Income_Category | 10127 | 6 | Less than $40K | 3561 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Card_Category | 10127 | 4 | Blue | 9436 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Months_on_book | 10127.0 | NaN | NaN | NaN | 35.928409 | 7.986416 | 13.0 | 31.0 | 36.0 | 40.0 | 56.0 |
| Total_Relationship_Count | 10127.0 | NaN | NaN | NaN | 3.81258 | 1.554408 | 1.0 | 3.0 | 4.0 | 5.0 | 6.0 |
| Months_Inactive_12_mon | 10127.0 | NaN | NaN | NaN | 2.341167 | 1.010622 | 0.0 | 2.0 | 2.0 | 3.0 | 6.0 |
| Contacts_Count_12_mon | 10127.0 | NaN | NaN | NaN | 2.455317 | 1.106225 | 0.0 | 2.0 | 2.0 | 3.0 | 6.0 |
| Credit_Limit | 10127.0 | NaN | NaN | NaN | 8631.953698 | 9088.77665 | 1438.3 | 2555.0 | 4549.0 | 11067.5 | 34516.0 |
| Total_Revolving_Bal | 10127.0 | NaN | NaN | NaN | 1162.814061 | 814.987335 | 0.0 | 359.0 | 1276.0 | 1784.0 | 2517.0 |
| Avg_Open_To_Buy | 10127.0 | NaN | NaN | NaN | 7469.139637 | 9090.685324 | 3.0 | 1324.5 | 3474.0 | 9859.0 | 34516.0 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | NaN | NaN | NaN | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | NaN | NaN | NaN | 4404.086304 | 3397.129254 | 510.0 | 2155.5 | 3899.0 | 4741.0 | 18484.0 |
| Total_Trans_Ct | 10127.0 | NaN | NaN | NaN | 64.858695 | 23.47257 | 10.0 | 45.0 | 67.0 | 81.0 | 139.0 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | NaN | NaN | NaN | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | NaN | NaN | NaN | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
# For dropping columns
columns_to_drop = [
"credit_limit",
"dependent_count",
"months_on_book",
"avg_open_to_buy",
"customer_age"
]
# For masking a particular value in a feature
column_to_mask_value = "income_category"
value_to_mask = "abc"
masked_value = "Unknown"
loss_func = "logloss"
# Test and Validation sizes
test_size = 0.2
val_size = 0.25
# Dependent Varibale Value map
target_mapper = {"Attrited Customer": 1, "Existing Customer": 0}
cat_columns = df1.select_dtypes(include="object").columns.tolist()
df1[cat_columns] = df1[cat_columns].astype("category")
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 10127 non-null category 5 Marital_Status 10127 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(6), float64(5), int64(9) memory usage: 1.1 MB
Split Train, test¶
X = df1.drop(columns=["Attrition_Flag"])
y = df1["Attrition_Flag"].map(target_mapper)
#Splitting into 80-20
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
Fit the Dataset¶
print(
"Training data shape: \n\n",
X_train.shape,
"\n\nTesting Data Shape: \n\n",
X_test.shape,
)
Training data shape: (6075, 19) Testing Data Shape: (2026, 19)
print("Training: \n", y_train.value_counts(normalize=True))
print("\n\nValidation: \n", y_val.value_counts(normalize=True))
print("\n\nTest: \n", y_test.value_counts(normalize=True))
Training: Attrition_Flag 0 0.839342 1 0.160658 Name: proportion, dtype: float64 Validation: Attrition_Flag 0 0.839092 1 0.160908 Name: proportion, dtype: float64 Test: Attrition_Flag 0 0.839585 1 0.160415 Name: proportion, dtype: float64
from sklearn.compose import ColumnTransformer
from sklearn.base import TransformerMixin
# Building a function to standardize columns
def feature_name_standardize(df: pd.DataFrame):
df_ = df.copy()
df_.columns = [i.replace(" ", "_").lower() for i in df_.columns]
return df_
# Building a function to drop features
def drop_feature(df: pd.DataFrame, features: list = []):
df_ = df.copy()
if len(features) != 0:
df_ = df_.drop(columns=features)
return df_
# Building a function to treat incorrect value
def mask_value(df: pd.DataFrame, feature: str = None, value_to_mask: str = None, masked_value: str = None):
df_ = df.copy()
if feature != None and value_to_mask != None:
if feature in df_.columns:
df_[feature] = df_[feature].astype('object')
df_.loc[df_[df_[feature] == value_to_mask].index, feature] = masked_value
df_[feature] = df_[feature].astype('category')
return df_
# Building a custom imputer
def impute_category_unknown(df: pd.DataFrame, fill_value: str):
df_ = df.copy()
for col in df_.select_dtypes(include='category').columns.tolist():
df_[col] = df_[col].astype('object')
df_[col] = df_[col].fillna('Unknown')
df_[col] = df_[col].astype('category')
return df_
# Building a custom data preprocessing class with fit and transform methods for standardizing column names
class FeatureNamesStandardizer(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
"""All SciKit-Learn compatible transformers and classifiers have the
same interface. `fit` always returns the same object."""
return self
def transform(self, X):
"""Returns dataframe with column names in lower case with underscores in place of spaces."""
X_ = feature_name_standardize(X)
return X_
# Building a custom data preprocessing class with fit and transform methods for dropping columns
class ColumnDropper(TransformerMixin):
def __init__(self, features: list):
self.features = features
def fit(self, X, y=None):
"""All SciKit-Learn compatible transformers and classifiers have the
same interface. `fit` always returns the same object."""
return self
def transform(self, X):
"""Given a list of columns, returns a dataframe without those columns."""
X_ = drop_feature(X, features=self.features)
return X_
# Building a custom data preprocessing class with fit and transform methods for custom value masking
class CustomValueMasker(TransformerMixin):
def __init__(self, feature: str, value_to_mask: str, masked_value: str):
self.feature = feature
self.value_to_mask = value_to_mask
self.masked_value = masked_value
def fit(self, X, y=None):
"""All SciKit-Learn compatible transformers and classifiers have the
same interface. `fit` always returns the same object."""
return self
def transform(self, X):
"""Return a dataframe with the required feature value masked as required."""
X_ = mask_value(X, self.feature, self.value_to_mask, self.masked_value)
return X_
# Building a custom class to one-hot encode using pandas
class PandasOneHot(TransformerMixin):
def __init__(self, columns: list = None):
self.columns = columns
def fit(self, X, y=None):
"""All SciKit-Learn compatible transformers and classifiers have the
same interface. `fit` always returns the same object."""
return self
def transform(self, X):
"""Return a dataframe with the required feature value masked as required."""
X_ = pd.get_dummies(X, columns = self.columns, drop_first=True)
return X_
# Building a custom class to fill nulls with Unknown
class FillUnknown(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
"""All SciKit-Learn compatible transformers and classifiers have the
same interface. `fit` always returns the same object."""
return self
def transform(self, X):
"""Return a dataframe with the required feature value masked as required."""
X_ = impute_category_unknown(X, fill_value='Unknown')
return X_
# To Standardize feature names
feature_name_standardizer = FeatureNamesStandardizer()
X_train = feature_name_standardizer.fit_transform(X_train)
X_val = feature_name_standardizer.transform(X_val)
X_test = feature_name_standardizer.transform(X_test)
# To Drop unnecessary columns
column_dropper = ColumnDropper(features=columns_to_drop)
X_train = column_dropper.fit_transform(X_train)
X_val = column_dropper.transform(X_val)
X_test = column_dropper.transform(X_test)
# To Mask incorrect/meaningless value of a feature
value_masker = CustomValueMasker(
feature=column_to_mask_value, value_to_mask=value_to_mask, masked_value=masked_value
)
X_train = value_masker.fit_transform(X_train)
X_val = value_masker.transform(X_val)
X_test = value_masker.transform(X_test)
# To impute categorical Nulls to Unknown
cat_columns = X_train.select_dtypes(include="category").columns.tolist()
imputer = FillUnknown()
X_train[cat_columns] = imputer.fit_transform(X_train[cat_columns])
X_val[cat_columns] = imputer.transform(X_val[cat_columns])
X_test[cat_columns] = imputer.transform(X_test[cat_columns])
# To encode the data
one_hot = PandasOneHot()
X_train = one_hot.fit_transform(X_train)
X_val = one_hot.transform(X_val)
X_test = one_hot.transform(X_test)
# Scale the numerical columns
robust_scaler = RobustScaler(with_centering=False, with_scaling=True)
num_columns = [
"total_relationship_count",
"months_inactive_12_mon",
"contacts_count_12_mon",
"total_revolving_bal",
"total_amt_chng_q4_q1",
"total_trans_amt",
"total_trans_ct",
"total_ct_chng_q4_q1",
"avg_utilization_ratio",
]
X_train[num_columns] = pd.DataFrame(
robust_scaler.fit_transform(X_train[num_columns]),
columns=num_columns,
index=X_train.index,
)
X_val[num_columns] = pd.DataFrame(
robust_scaler.transform(X_val[num_columns]), columns=num_columns, index=X_val.index
)
X_test[num_columns] = pd.DataFrame(
robust_scaler.transform(X_test[num_columns]),
columns=num_columns,
index=X_test.index,
)
Model Building - Original Data¶
Metrics for evaluation¶
def get_metrics_score(model, train, test, train_y, test_y, threshold=0.5, flag=False, roc=True):
# defining an empty list to store train and test results
score_list = []
pred_train = model.predict_proba(train)[:, 1] > threshold
pred_test = model.predict_proba(test)[:, 1] > threshold
pred_train = np.round(pred_train)
pred_test = np.round(pred_test)
train_acc = accuracy_score(pred_train, train_y)
test_acc = accuracy_score(pred_test, test_y)
train_recall = recall_score(train_y, pred_train)
test_recall = recall_score(test_y, pred_test)
train_precision = precision_score(train_y, pred_train)
test_precision = precision_score(test_y, pred_test)
train_f1 = f1_score(train_y, pred_train)
test_f1 = f1_score(test_y, pred_test)
pred_train_proba = model.predict_proba(train)[:, 1]
pred_test_proba = model.predict_proba(test)[:, 1]
train_roc_auc = roc_auc_score(train_y, pred_train_proba)
test_roc_auc = roc_auc_score(test_y, pred_test_proba)
score_list.extend(
(
train_acc,
test_acc,
train_recall,
test_recall,
train_precision,
test_precision,
train_f1,
test_f1,
train_roc_auc,
test_roc_auc,
)
)
if flag == True:
print("Accuracy on training set : ", accuracy_score(pred_train, train_y))
print("Accuracy on test set : ", accuracy_score(pred_test, test_y))
print("Recall on training set : ", recall_score(train_y, pred_train))
print("Recall on test set : ", recall_score(test_y, pred_test))
print("Precision on training set : ", precision_score(train_y, pred_train))
print("Precision on test set : ", precision_score(test_y, pred_test))
print("F1 on training set : ", f1_score(train_y, pred_train))
print("F1 on test set : ", f1_score(test_y, pred_test))
if roc == True:
if flag == True:
print(
"ROC-AUC Score on training set : ",
roc_auc_score(train_y, pred_train_proba),
)
print(
"ROC-AUC Score on test set : ", roc_auc_score(test_y, pred_test_proba)
)
return score_list
# # defining empty lists to add train and test results
model_names = []
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
roc_auc_train = []
roc_auc_test = []
cross_val_train = []
def add_score_model(model_name, score, cv_res):
"""Add scores to list so that we can compare all models score together"""
model_names.append(model_name)
acc_train.append(score[0])
acc_test.append(score[1])
recall_train.append(score[2])
recall_test.append(score[3])
precision_train.append(score[4])
precision_test.append(score[5])
f1_train.append(score[6])
f1_test.append(score[7])
roc_auc_train.append(score[8])
roc_auc_test.append(score[9])
cross_val_train.append(cv_res)
## for confusion matrix
def make_confusion_matrix(model, test_X, y_actual, labels=[1, 0]):
"""
model : classifier to predict values of X
test_X: test set
y_actual : ground truth
"""
y_predict = model.predict(test_X)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=[1, 0])
df_cm = pd.DataFrame(
cm,
index=[i for i in ["Actual - Attrited", "Actual - Existing"]],
columns=[i for i in ["Predicted - Attrited", "Predicted - Existing"]],
)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(5, 3))
sns.heatmap(df_cm, annot=labels, fmt="", cmap="Blues").set(title="Confusion Matrix")
print(
"Training data shape: \n\n",
X_train.shape,
"\n\nValidation Data Shape: \n\n",
X_val.shape,
"\n\nTesting Data Shape: \n\n",
X_test.shape,
)
Training data shape: (6075, 27) Validation Data Shape: (2026, 27) Testing Data Shape: (2026, 27)
Build 5 models¶
- Bagging
- Random Forest Classifier
- Gradient Boosting
- Decision Tree Classifier
- Adaptive Boosting
models = [] # Empty list to store all the models
cv_results = []
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1, algorithm='SAMME')))
models.append(("Decisiontree", DecisionTreeClassifier(random_state=1)))
# For each model, run cross validation on 9 folds (+ 1 validation fold) with scoring for recall
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1) # Setting number of splits equal to 10
cv_result = cross_val_score(estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold)
cv_results.append(cv_result)
model.fit(X_train, y_train)
model_score = get_metrics_score(model, X_train, X_val, y_train, y_val)
add_score_model(name, model_score, cv_result.mean())
print("Added all models!")
Added all models!
Compare 5 models¶
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False).style.highlight_max(color="green", axis=0).highlight_min(color="orange", axis=0)
| Model | Cross_Val_Score_Train | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | Train_ROC_AUC | Test_ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | GBM | 0.817620 | 0.969712 | 0.969398 | 0.873975 | 0.874233 | 0.933260 | 0.931373 | 0.902646 | 0.901899 | 0.992689 | 0.989937 |
| 0 | Bagging | 0.785862 | 0.996049 | 0.954590 | 0.980533 | 0.822086 | 0.994802 | 0.887417 | 0.987616 | 0.853503 | 0.999899 | 0.978021 |
| 1 | Random forest | 0.770440 | 1.000000 | 0.959526 | 1.000000 | 0.812883 | 1.000000 | 0.926573 | 1.000000 | 0.866013 | 1.000000 | 0.983956 |
| 4 | Decisiontree | 0.754113 | 1.000000 | 0.937315 | 1.000000 | 0.806748 | 1.000000 | 0.804281 | 1.000000 | 0.805513 | 1.000000 | 0.884551 |
| 3 | Adaboost | 0.729560 | 0.942551 | 0.955084 | 0.753074 | 0.803681 | 0.871886 | 0.906574 | 0.808136 | 0.852033 | 0.978865 | 0.980718 |
- Best model is Gradient Boosting and the next best are Bagging and RandomForest respectively
Cross-validation Result¶
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Models Comparison")
ax = fig.add_subplot(111)
plt.boxplot(cv_results)
ax.set_xticklabels(model_names)
plt.show()
Model Building - Oversampled data¶
print("Before OverSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Overampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy="minority", k_neighbors=10, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After OverSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After OverSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, counts of label 'Yes': 976 Before Overampling, counts of label 'No': 5099 After OverSampling, counts of label 'Yes': 5099 After OverSampling, counts of label 'No': 5099 After OverSampling, the shape of train_X: (10198, 27) After OverSampling, the shape of train_y: (10198,)
Train¶
models_over = []
# Appending models into the list
models_over.append(("Bagging OverSampling", BaggingClassifier(random_state=1)))
models_over.append(("Random forest OverSampling", RandomForestClassifier(random_state=1)))
models_over.append(("GBM OverSampling", GradientBoostingClassifier(random_state=1)))
models_over.append(("Adaboost OverSampling", AdaBoostClassifier(random_state=1, algorithm='SAMME')))
models_over.append(("Decision Tree OverSampling", DecisionTreeClassifier(random_state=1)))
for name, model in models_over:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=10, shuffle=True, random_state=10
) # Setting number of splits equal to 10
cv_result_over = cross_val_score(
estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
)
cv_results.append(cv_result_over)
model.fit(X_train_over, y_train_over)
model_score_over = get_metrics_score(
model, X_train_over, X_val, y_train_over, y_val
)
add_score_model(name, model_score_over, cv_result_over.mean())
print("Adding Oversampling models Completed!")
Adding Oversampling models Completed!
Compare models¶
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="green", axis=0).highlight_min(color="orange", axis=0)
| Model | Cross_Val_Score_Train | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | Train_ROC_AUC | Test_ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | GBM OverSampling | 0.967445 | 0.970484 | 0.957552 | 0.975093 | 0.917178 | 0.966187 | 0.835196 | 0.970620 | 0.874269 | 0.995988 | 0.988800 |
| 8 | Adaboost OverSampling | 0.941948 | 0.934105 | 0.918559 | 0.947245 | 0.911043 | 0.922989 | 0.685912 | 0.934959 | 0.782609 | 0.982744 | 0.972768 |
| 6 | Random forest OverSampling | 0.980780 | 1.000000 | 0.956565 | 1.000000 | 0.895706 | 1.000000 | 0.843931 | 1.000000 | 0.869048 | 1.000000 | 0.985522 |
| 2 | GBM | 0.817620 | 0.969712 | 0.969398 | 0.873975 | 0.874233 | 0.933260 | 0.931373 | 0.902646 | 0.901899 | 0.992689 | 0.989937 |
| 5 | Bagging OverSampling | 0.962738 | 0.996960 | 0.943731 | 0.996862 | 0.861963 | 0.997058 | 0.802857 | 0.996960 | 0.831361 | 0.999969 | 0.973466 |
| 0 | Bagging | 0.785862 | 0.996049 | 0.954590 | 0.980533 | 0.822086 | 0.994802 | 0.887417 | 0.987616 | 0.853503 | 0.999899 | 0.978021 |
| 9 | Decision Tree OverSampling | 0.943519 | 1.000000 | 0.923001 | 1.000000 | 0.819018 | 1.000000 | 0.733516 | 1.000000 | 0.773913 | 1.000000 | 0.880980 |
| 1 | Random forest | 0.770440 | 1.000000 | 0.959526 | 1.000000 | 0.812883 | 1.000000 | 0.926573 | 1.000000 | 0.866013 | 1.000000 | 0.983956 |
| 4 | Decisiontree | 0.754113 | 1.000000 | 0.937315 | 1.000000 | 0.806748 | 1.000000 | 0.804281 | 1.000000 | 0.805513 | 1.000000 | 0.884551 |
| 3 | Adaboost | 0.729560 | 0.942551 | 0.955084 | 0.753074 | 0.803681 | 0.871886 | 0.906574 | 0.808136 | 0.852033 | 0.978865 | 0.980718 |
- After oversampling, Gradient Boost Oversampling model, Adaboost Oversampled, RandomForest Oversampled Models follow compared to earlier models.
Model Building - Undersampled data¶
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 976 Before Under Sampling, counts of label 'No': 5099 After Under Sampling, counts of label 'Yes': 976 After Under Sampling, counts of label 'No': 976 After Under Sampling, the shape of train_X: (1952, 27) After Under Sampling, the shape of train_y: (1952,)
Build Models¶
models_under = []
# Appending models into the list
models_under.append(("Bagging UnderSampling", BaggingClassifier(random_state=1)))
models_under.append(("Random forest UnderSampling", RandomForestClassifier(random_state=1)))
models_under.append(("GBM UnderSampling", GradientBoostingClassifier(random_state=1)))
models_under.append(("Adaboost UnderSampling", AdaBoostClassifier(random_state=1, algorithm='SAMME')))
models_under.append(("DecisionTree UnderSampling", DecisionTreeClassifier(random_state=1)))
for name, model in models_under:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=10, shuffle=True, random_state=1
) # Setting number of splits equal to 10
cv_result_under = cross_val_score(
estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
)
cv_results.append(cv_result_under)
model.fit(X_train_un, y_train_un)
model_score_under = get_metrics_score(model, X_train_un, X_val, y_train_un, y_val)
add_score_model(name, model_score_under, cv_result_under.mean())
print("Adding Undersampling models Completed!")
Adding Undersampling models Completed!
Compare Undersampling Models¶
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(
by=["Test_Recall", "Cross_Val_Score_Train"], ascending=False
).style.highlight_max(color="green", axis=0).highlight_min(color="orange", axis=0)
| Model | Cross_Val_Score_Train | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | Train_ROC_AUC | Test_ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12 | GBM UnderSampling | 0.951799 | 0.967725 | 0.938796 | 0.979508 | 0.957055 | 0.956957 | 0.739336 | 0.968101 | 0.834225 | 0.995357 | 0.989747 |
| 11 | Random forest UnderSampling | 0.935388 | 1.000000 | 0.928430 | 1.000000 | 0.932515 | 1.000000 | 0.711944 | 1.000000 | 0.807437 | 1.000000 | 0.979840 |
| 10 | Bagging UnderSampling | 0.920029 | 0.994365 | 0.924482 | 0.990779 | 0.932515 | 0.997936 | 0.698851 | 0.994344 | 0.798949 | 0.999701 | 0.972970 |
| 13 | Adaboost UnderSampling | 0.917968 | 0.928791 | 0.918559 | 0.933402 | 0.926380 | 0.924873 | 0.681716 | 0.929118 | 0.785436 | 0.979916 | 0.979626 |
| 7 | GBM OverSampling | 0.967445 | 0.970484 | 0.957552 | 0.975093 | 0.917178 | 0.966187 | 0.835196 | 0.970620 | 0.874269 | 0.995988 | 0.988800 |
| 8 | Adaboost OverSampling | 0.941948 | 0.934105 | 0.918559 | 0.947245 | 0.911043 | 0.922989 | 0.685912 | 0.934959 | 0.782609 | 0.982744 | 0.972768 |
| 6 | Random forest OverSampling | 0.980780 | 1.000000 | 0.956565 | 1.000000 | 0.895706 | 1.000000 | 0.843931 | 1.000000 | 0.869048 | 1.000000 | 0.985522 |
| 14 | DecisionTree UnderSampling | 0.896423 | 1.000000 | 0.891412 | 1.000000 | 0.886503 | 1.000000 | 0.612288 | 1.000000 | 0.724311 | 1.000000 | 0.889428 |
| 2 | GBM | 0.817620 | 0.969712 | 0.969398 | 0.873975 | 0.874233 | 0.933260 | 0.931373 | 0.902646 | 0.901899 | 0.992689 | 0.989937 |
| 5 | Bagging OverSampling | 0.962738 | 0.996960 | 0.943731 | 0.996862 | 0.861963 | 0.997058 | 0.802857 | 0.996960 | 0.831361 | 0.999969 | 0.973466 |
| 0 | Bagging | 0.785862 | 0.996049 | 0.954590 | 0.980533 | 0.822086 | 0.994802 | 0.887417 | 0.987616 | 0.853503 | 0.999899 | 0.978021 |
| 9 | Decision Tree OverSampling | 0.943519 | 1.000000 | 0.923001 | 1.000000 | 0.819018 | 1.000000 | 0.733516 | 1.000000 | 0.773913 | 1.000000 | 0.880980 |
| 1 | Random forest | 0.770440 | 1.000000 | 0.959526 | 1.000000 | 0.812883 | 1.000000 | 0.926573 | 1.000000 | 0.866013 | 1.000000 | 0.983956 |
| 4 | Decisiontree | 0.754113 | 1.000000 | 0.937315 | 1.000000 | 0.806748 | 1.000000 | 0.804281 | 1.000000 | 0.805513 | 1.000000 | 0.884551 |
| 3 | Adaboost | 0.729560 | 0.942551 | 0.955084 | 0.753074 | 0.803681 | 0.871886 | 0.906574 | 0.808136 | 0.852033 | 0.978865 | 0.980718 |
- After Undersampling, Gradient Boosting undersampled, Randomforest undersampled, Bagging undersampled models perform better than all other models
- Best 3 models are as follows:
- Gradient Boosting Undersampling
- Random Forest Undersamling
- Bagging Undersampling
Model Performance Improvement using Hyperparameter Tuning¶
Choice of 3 models that can be tuned¶
- Gradient Boosting OverSampling: Worth tuning to potentially enhance already strong performance metrics.
- Adaboost UnderSampling: Needs tuning to improve performance metrics where Accuracy is least of all other models.
- Random Forest UnderSampling: Needs tuning due to potential overfitting (perfect train and test accuracy).
These models are selected based on their performance metrics and potential for improvement through tuning. This helps to optimize performance while avoiding overfitting and improving generalization to new data.
Tuning - Gradient Boosting Oversampling¶
# defining model - Gradient Boosting Oversampling
model = GradientBoostingClassifier(random_state=1)
# Parameter grid
param_grid = {
'n_estimators': [50, 100, 500],
'learning_rate': [0.05, 0.1, 0.2],
'max_depth': [3, 5, 10],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2],
'subsample': [0.8, 1.0],
'max_features': ['sqrt', 'log2']
}
# RandomizedSearchCV setup
gbm_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid,
n_iter=5, scoring=scorer, cv=3, random_state=1, n_jobs=1)
# Fit the RandomizedSearchCV
gbm_tuned.fit(X_train_over, y_train_over)
# Output the best parameters and the best score
print("Best parameters found: ", gbm_tuned.best_params_)
print("Best CV recall score: ", gbm_tuned.best_score_)
Best parameters found: {'subsample': 1.0, 'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10, 'learning_rate': 0.05}
Best CV recall score: 0.9786257198582788
# Create a new GradientBoostingClassifier with the best parameters
best_gbm = GradientBoostingClassifier(**gbm_tuned.best_params_, random_state=1)
# Fit the model on training data
best_gbm.fit(X_train_over, y_train_over)
GradientBoostingClassifier(learning_rate=0.05, max_depth=10,
max_features='sqrt', min_samples_leaf=2,
min_samples_split=5, n_estimators=500,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=0.05, max_depth=10,
max_features='sqrt', min_samples_leaf=2,
min_samples_split=5, n_estimators=500,
random_state=1)gbm_tuned_model_score = get_metrics_score(best_gbm, X_train, X_val, y_train, y_val)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
gbm_down_cv = cross_val_score(estimator=best_gbm, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold)
add_score_model("Tuned GBM Over Sampling", gbm_tuned_model_score, gbm_down_cv.mean())
make_confusion_matrix(best_gbm, X_val, y_val)
Tuning - AdaBoost Undersampling¶
# Define the base model (AdaBoostClassifier)
model = AdaBoostClassifier(random_state=1, algorithm='SAMME')
# Parameter grid for AdaBoost (smaller ranges for faster tuning)
param_grid = {
'n_estimators': [50, 100, 500], # Reduced number of estimators
'learning_rate': [0.01, 0.1, 0.5, 1.0], # Focus on a few key values for learning rate
}
# Scoring metric (recall)
scorer = metrics.make_scorer(recall_score)
# RandomizedSearchCV setup (reduced n_iter and lower cv)
adaboost_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid,
n_iter=10, scoring=scorer, cv=3, random_state=1, n_jobs=1)
# Fit the RandomizedSearchCV
adaboost_tuned.fit(X_train_un, y_train_un)
# Output the best parameters and the best score
print("Best parameters found: ", adaboost_tuned.best_params_)
print("Best CV recall score: ", adaboost_tuned.best_score_)
Best parameters found: {'n_estimators': 500, 'learning_rate': 1.0}
Best CV recall score: 0.9364826175869121
# Create a new AdaBoostingClassifier with the best parameters
best_ada = AdaBoostClassifier(**adaboost_tuned.best_params_, random_state=1, algorithm='SAMME')
# Fit the model on training data
best_ada.fit(X_train_un, y_train_un)
AdaBoostClassifier(algorithm='SAMME', n_estimators=500, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(algorithm='SAMME', n_estimators=500, random_state=1)
ada_tuned_model_score = get_metrics_score(best_ada, X_train, X_val, y_train, y_val)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
ada_down_cv = cross_val_score(estimator=best_ada, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold)
add_score_model("Tuned AdaBoost Under Sampling", ada_tuned_model_score, ada_down_cv.mean())
make_confusion_matrix(best_ada, X_val, y_val)
Tuning - Random Forest UnderSampling¶
# Define the model
model = RandomForestClassifier(random_state=1)
# Parameter grid
param_grid = {
'n_estimators': [50, 100, 150], # Limit the range for speed
'max_depth': [10, 20, 30, None], # Include None for unlimited depth
'min_samples_split': [2, 5, 10], # Focus on reasonable values
'min_samples_leaf': [1, 2, 4],
'max_features': ['log2', 'sqrt'], # Common options
'bootstrap': [True, False] # Bootstrap sampling (whether or not to sample with replacement)
}
# Scoring metric (recall)
scorer = metrics.make_scorer(recall_score)
# RandomizedSearchCV setup (reduced n_iter and lower cv)
rf_tuned = RandomizedSearchCV(estimator=model, param_distributions=param_grid,
n_iter=10, scoring=scorer, cv=3, random_state=1, n_jobs=1)
# Fit the RandomizedSearchCV
rf_tuned.fit(X_train_un, y_train_un)
# Output the best parameters and the best score
print("Best parameters found: ", rf_tuned.best_params_)
print("Best CV recall score: ", rf_tuned.best_score_)
Best parameters found: {'n_estimators': 150, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 30, 'bootstrap': False}
Best CV recall score: 0.9364920560012585
# Create a new RandomForestClassifier with the best parameters
best_rf = RandomForestClassifier(**rf_tuned.best_params_, random_state=1)
# Fit the model on training data
best_rf.fit(X_train_un, y_train_un)
RandomForestClassifier(bootstrap=False, max_depth=30, min_samples_split=10,
n_estimators=150, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(bootstrap=False, max_depth=30, min_samples_split=10,
n_estimators=150, random_state=1)rf_tuned_model_score = get_metrics_score(best_rf, X_train, X_val, y_train, y_val)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
scoring = "recall"
rf_cv = cross_val_score(estimator=best_rf, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold)
add_score_model("Tuned RandomForest Under Sampling", rf_tuned_model_score, rf_cv.mean())
make_confusion_matrix(best_rf, X_val, y_val)
comparison_frame = pd.DataFrame(
{
"Model": model_names,
"Cross_Val_Score_Train": cross_val_train,
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
"Train_F1": f1_train,
"Test_F1": f1_test,
"Train_ROC_AUC": roc_auc_train,
"Test_ROC_AUC": roc_auc_test,
}
)
# model_names
# comparison_frame
# for col in comparison_frame.select_dtypes(include="float64").columns.tolist():
# comparison_frame[col] = comparison_frame[col] * 100, 0).astype(int)
comparison_frame.tail(3).sort_values(
by=["Cross_Val_Score_Train", "Test_Recall"], ascending=False
)
| Model | Cross_Val_Score_Train | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | Train_ROC_AUC | Test_ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15 | Tuned GBM Over Sampling | 0.985096 | 1.000000 | 0.964462 | 1.000000 | 0.889571 | 1.000000 | 0.889571 | 1.000000 | 0.889571 | 1.000000 | 0.990971 |
| 17 | Tuned RandomForest Under Sampling | 0.944635 | 0.941564 | 0.931392 | 1.000000 | 0.947853 | 0.733283 | 0.716937 | 0.846121 | 0.816380 | 0.995619 | 0.981972 |
| 16 | Tuned AdaBoost Under Sampling | 0.938470 | 0.935473 | 0.936328 | 0.957992 | 0.953988 | 0.727061 | 0.731765 | 0.826702 | 0.828229 | 0.987684 | 0.989175 |
Performance of tuned models:¶
Tuned GBM Over Sampling:¶
- High Training and Test Scores: The model achieves perfect accuracy and recall on the training data and very high accuracy and recall on the test data, indicating strong performance on both datasets.
- Precision Trade-Off: The precision is lower on the training set compared to the test set, which could indicate that the model is more conservative on the training set, potentially overfitting to the training data.
- ROC AUC: The ROC AUC scores are excellent, showing that the model has a high ability to distinguish between the classes.
Tuned RandomForest Under Sampling:¶
- High Recall on Training Set: The model achieves perfect recall on the training set but has slightly lower recall on the test set. This suggests the model is very good at identifying positive cases during training but slightly less effective on unseen data.
- Precision and F1 Scores: The precision and F1 scores are lower compared to the GBM model, indicating a trade-off between precision and recall.
- ROC AUC: Very high ROC AUC scores on both training and test data, indicating the model's strong discriminative power.
Tuned AdaBoost Under Sampling¶
- Balanced Performance: The AdaBoost model shows relatively consistent performance across training and test datasets, with close accuracy, recall, precision, and F1 scores.
- High ROC AUC: The ROC AUC scores are also high, demonstrating strong performance in distinguishing between classes.
- Moderate Precision and Recall: The model has a good balance of precision and recall, which might be preferred depending on the application’s requirements.
Model Performance Comparison and Final Model Selection¶
- Of all, Gradient boosting of oversampling data is having the highest accuracy.
#Lets check the test data
feature_names = X_train.columns
importances =best_gbm.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
gbm_tuned_model_test_score = get_metrics_score(
best_gbm, X_train, X_test, y_train, y_test
)
final_model_names = ["gbm Tuned Down-sampled Trained"]
final_acc_train = [gbm_tuned_model_test_score[0]]
final_acc_test = [gbm_tuned_model_test_score[1]]
final_recall_train = [gbm_tuned_model_test_score[2]]
final_recall_test = [gbm_tuned_model_test_score[3]]
final_precision_train = [gbm_tuned_model_test_score[4]]
final_precision_test = [gbm_tuned_model_test_score[5]]
final_f1_train = [gbm_tuned_model_test_score[6]]
final_f1_test = [gbm_tuned_model_test_score[7]]
final_roc_auc_train = [gbm_tuned_model_test_score[8]]
final_roc_auc_test = [gbm_tuned_model_test_score[9]]
final_result_score = pd.DataFrame(
{
"Model": final_model_names,
"Train_Accuracy": final_acc_train,
"Test_Accuracy": final_acc_test,
"Train_Recall": final_recall_train,
"Test_Recall": final_recall_test,
"Train_Precision": final_precision_train,
"Test_Precision": final_precision_test,
"Train_F1": final_f1_train,
"Test_F1": final_f1_test,
"Train_ROC_AUC": final_roc_auc_train,
"Test_ROC_AUC": final_roc_auc_test,
}
)
for col in final_result_score.select_dtypes(include="float64").columns.tolist():
final_result_score[col] = final_result_score[col] * 100
final_result_score
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1 | Test_F1 | Train_ROC_AUC | Test_ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | gbm Tuned Down-sampled Trained | 100.0 | 97.137216 | 100.0 | 92.923077 | 100.0 | 89.614243 | 100.0 | 91.238671 | 100.0 | 99.324922 |
- Performance is very good on the test data
make_confusion_matrix(best_gbm, X_test, y_test)
y_pred_prob = best_gbm.predict_proba(X_test)[:, 1] # Probability estimates for the positive class
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
roc_auc = metrics.auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
- AUC is 0.99, which indicates that the GBM Oversampling Tuned model has excellent performance. A value close to 1 means the model is almost perfect at distinguishing between positive and negative classes.
Actionable Insights and Recommendations¶
Top Features:¶
- total_trans_amt (Total Transaction Amount) and total_trans_ct (Total Transaction Count):
Customers who spend and transact more frequently are likely important for business outcomes (e.g., loyal customers or high-value customers). Recommendation: Focus on these customers with rewards programs, personalized offers, or loyalty initiatives to retain and engage them further.
- total_revolving_bal (Total Revolving Balance):
Customers with a higher revolving balance could indicate frequent credit usage or a reliance on credit. Recommendation: Provide these customers with financial advice, credit management tools, or promotional interest rate offers to reduce their balance.
- total_ct_chng_q4_q1 (Change in Transaction Count from Q4 to Q1) and total_relationship_count:
A large change in transaction counts between quarters might indicate seasonal behavior or changes in customer needs. Customers with multiple product relationships (savings, loans, etc.) are likely to be more engaged. Recommendation: Tailor marketing efforts based on seasonal trends, and offer cross-product promotions to further deepen customer relationships.
- avg_utilization_ratio (Average Utilization Ratio):
High utilization ratios may signal riskier financial behavior or higher credit dependency. Recommendation: Offer these customers debt counseling or credit limit adjustments to improve their financial health.
- months_inactive_12_mon (Months Inactive in Last 12 Months):
Customers who have been inactive for several months are at risk of churn. Recommendation: Re-engage these customers with targeted offers, reminders, or personalized services to bring them back into active usage.
- contacts_count_12_mon (Number of Contacts in the Last 12 Months):
More contact with the customer (e.g., customer service interactions) may indicate either dissatisfaction or a strong relationship. Recommendation: Analyze the nature of these interactions to identify potential pain points or opportunities for enhancing customer support.
Lower Importance Features:¶
- Income Category, Education Level, Marital Status:
These demographic features have much lower importance compared to transaction and account-specific metrics. Actionable Insights: While demographics are useful for segmentation, focus more on behavioral and account-specific features for predictive purposes.
By focusing on transaction activity, account balances, and relationship depth, the business can target specific customer groups for retention, growth, and engagement.
- Churn Prevention: Focus retention efforts on customers who show declining transaction activity or months of inactivity. Proactive engagement could help prevent churn.